Chapter 3 - Strings

This notebook uses code snippets and explanations from this course.

The first thing you learned was printing a simple sentence: "Hello, world!" This sentence, as any text, was stored by Python as a string. Since many disciplines within the Humanities and Social Sciences work with texts, quite naturally we will focus a lot on manipulating texts in this course. Therefore, strings will be an important data type for us. This Notebook is devoted to this object type.

At the end of this chapter, you will be able to:

  • define strings and understand their internal representation
  • understand string as sequences
  • use character indices for string slicing
  • combine strings through printing, concatenation and insertion
  • compare strings using comparison operators and the in operator
  • understand strings as immutable objects
  • work with and understand string methods
  • understand the difference between args and kwargs

If you have questions about this chapter, please refer to the forum on Canvas.

1. Defining and representing strings

A string is a sequence of letters/characters which are compounded to form a whole. In Python a string is a type of variable for which the value is enclosed by single or double quotes. Let's define a few of them:


In [ ]:
# Here are some strings:
string_1 = "Hello, world!"
string_2 = 'I ❤️ cheese'      # If you are using Python 2, your computer will not like this.
string_3 = '1,2,3,4,5,6,7,8,9'

There is no difference in declaring a string with single or double quotes. However, if your string contains a quote symbol it can lead to errors if you try to enclose it with the same quotes.


In [ ]:
# Run this cell to see the error generated by the following line.
restaurant = 'Wendy's'

In the example above the error indicates that there is something wrong with the letter s. This is because the single quote closes the string we started, and anything after that is unexpected. To solve this we can enclose the string in double quotes, as follows:


In [ ]:
restaurant = "Wendy's"
# Similarly, we can enclose a string containing double quotes with single quotes:
quotes = 'Using "double" quotes enclosed by a single quote.'

We can also use the escape character "\" in front of the quote, which will tell Python not to treat this specific quote as the end of the string.


In [ ]:
restaurant = 'Wendy\'s'
print(restaurant)
restaurant = "Wendy\"s"
print(restaurant)

1.1 Multi-line strings

Strings in Python can also span across multiple lines, which can be useful for when you have a very long string, or when you want to format the output of the string in a certain way. This can be achieved in two ways:

  1. With single or double quotes, where we manually indicate that the rest of the string continues on the next line with a backslash.
  2. With three single or double quotes.

We will first demonstrate how this would work when you use one double or single quote.


In [ ]:
# This example also works with single-quotes.
long_string = "A very long string\n\
can be split into multiple\n\
sentences by appending a newline symbol\n\
to the end of the line."

print(long_string)

The \n or newline symbol indicates that we want to start the rest of the text on a new line in the string, the following \ indicates that we want the string to continue on the next line of the code. This difference can be quite hard to understand, but best illustrated with an example where we do not include the \n symbol.


In [ ]:
long_string = "A very long string \
can be split into multiple \
sentences by appending a backslash \
to the end of the line."

print(long_string)

As you can see, Python now interprets this example as a single line of text. If we use the recommended way in Python to write multiline strings, with triple double or single quotes, you will see that the \n or newline symbol is automatically included.


In [ ]:
long_string = """A very long string
can also be split into multiple 
sentences by enclosing the string
with three double or single quotes."""

print(long_string)

print()

another_long_string = '''A very long string
can also be split into multiple 
sentences by enclosing the string
with three double or single quotes.'''

print(another_long_string)

What will happen if you remove the backslash characters in the example? Try it out in the cell below.


In [ ]:
long_string = "A very long string\
can be split into multiple\
sentences by appending a backslash\
to the end of the line."

print(long_string)

1.2 Internal representation: using repr()

As we have seen above, it is possible to make strings that span multiple lines. Here are two ways to do so:


In [ ]:
multiline_text_1 = """This is a multiline text, so it is enclosed by triple quotes.
Pretty cool stuff!
I always wanted to type more than one line, so today is my lucky day!"""
multiline_text_2 = "This is a multiline text, so it is enclosed by triple quotes.\nPretty cool stuff!\nI always wanted to type more than one line, so today is my lucky day!"
print(multiline_text_1)
print() # this just prints an empty line
print(multiline_text_2)

Internally, these strings are equally represented. We can check that with the double equals sign, which checks if two objects are the same:


In [ ]:
print(multiline_text_1 == multiline_text_2)

So from this we can conclude that multiline_text_1 has the same hidden characters (in this case \n, which stands for 'new line') as multiline_text_2. You can show that this is indeed true by using the built-in repr() function (which gives you the Python-internal representation of an object).


In [ ]:
# Show the internal representation of multiline_text_1.
print(repr(multiline_text_1))
print(repr(multiline_text_2))

Another hidden character that is often used is \t, which represents tabs:


In [ ]:
colors = "yellow\tgreen\tblue\tred"
print(colors)
print(repr(colors))

2. Strings as sequences

2.1 String indices

Strings are simply sequences of characters. Each character in a string therefore has a position, which can be referred to by the index number of the position. The index numbers start at 0 and then increase to the length of the string. You can also start counting backwards using negative indices. The following table shows all characters of the sentence "Sandwiches are yummy" in the first row. The second row and the third row show respectively the positive and negative indices for each character:

Characters S a n d w i c h e s a r e y u m m y
Positive index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Negative index -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

You can access the characters of a string as follows:


In [ ]:
my_string = "Sandwiches are yummy"
print(my_string[1])   
print(my_string[-1])

Length: Python has a built-in function called len() that lets you compute the length of a sequence. It works like this:


In [ ]:
number_of_characters = len(my_string)
print(number_of_characters) # Note that spaces count as characters too!

2.2 String slicing

Besides using single indices we can also extract a range from a string:


In [ ]:
my_string = "Sandwiches are yummy"
print(my_string[1:4])

This is called string slicing. So how does this notation work?

my_string[i]                  # Get the character at index i.
my_string[start:end]          # Get the substring starting at 'start' and ending *before* 'end'.
my_string[start:end:stepsize] # Get all characters starting from 'start', ending before 'end', 
                              # with a specific step size.

You can also leave parts out:

my_string[:i]                 # Get the substring starting at index 0 and ending just before i.
my_string[i:]                 # Get the substring starting at i and running all the way to the end.
my_string[::i]                # Get a string going from start to end with step size i.

You can also have negative step size. my_string[::-1] is the idiomatic way to reverse a string.

Do you know what the following statements will print?


In [ ]:
print(my_string[1:4])

In [ ]:
print(my_string[1:4:1])

In [ ]:
print(my_string[11:14])

In [ ]:
print(my_string[15:])

In [ ]:
print(my_string[:9])

In [ ]:
print('cow'[::2])

In [ ]:
print('cow'[::-2])

3. Immutability

The mutability of an object refers to whether an object can change or not. Strings are immutable, meaning that they cannot be changed. It is possible to create a new string based on the old one, but we cannot modify the string in place. The cells below demonstrate this.


In [ ]:
# This is fine, because we are creating a new string
fruit = 'guanabana'
island = fruit[:5] 
print(island, 'island')

In [ ]:
# This works because we are creating a new string and overwriting our old one
fruit = fruit[5:] + 'na' 
print(fruit)

In [ ]:
# This does not work because now we are trying to change an existing string
fruit[4:5] = 'an' 
print(fruit)

In [ ]:
# If we want to do this then we need to do:
fruit = fruit[:4] + 'an'
print(fruit)

The reasons for why strings are immutable are beyond the scope of this notebook. Just remember that if you want to modify a string, you need to overwrite the entire string, and you cannot modify parts of it by using individual indices.

4. Comparing strings

In Python it is possible to use comparison operators (as used in conditional statements) on strings. These operators are: ==, !=, <, <=, >, and >=

String comparison is always case-sensitive. Some of the comparison operations (greater/smaller than) are useful for putting words in lexicographical order. This is similar to the alphabetical order you would use with a dictionary, except that all the uppercase letters come before all the lowercase letters (so first A, B, C, etc. and then a, b, c, etc.)


In [ ]:
print('a' == 'a')
print('a' != 'b')
print('a' == 'A')  # string comparison is case-sensitive
print('a' < 'b')   # alphabetical order
print('A' < 'a')   # uppercase comes before lowercase
print('B' < 'a')   # uppercase comes before lowercase
print()
print('orange' == 'Orange')
print('orange' > 'Orange')
print('orange' < 'Orange')
print('orange' > 'banana')
print('Orange' > 'banana')

Another way of comparing strings is to check whether a string is part of another string, which can be done using the in operator. It returns True if the string contains the relevant substring, and False if it doesn't. These two values (True and False) are called boolean values, or booleans for short. We'll talk about them in more detail later. Here are some examples to try:


In [ ]:
"fun" in "function"

In [ ]:
"I" in "Team"

In [ ]:
"App" in "apple" # Capitals are not the same as lowercase characters!

5. Printing, concatenating and inserting strings

You will often find yourself concatenating and printing combinations of strings. Consider the following examples:


In [ ]:
print("Hello", "World")
print("Hello " + "World")

Even though they may look similar, there are two different things happening here. Simply said: the plus in the expression is doing concatenation, but the comma is not doing concatenation.

The 'print()' function, which we have seen many times now, will print as strings everything in a comma-separated sequence of expressions to your screen, and it will separate the results with single blanks by default. Note that you can mix types: anything that is not already a string is automatically converted to its string representation.


In [ ]:
number = 5
print("I have", number, "apples")

String concatenation, on the other hand, happens when we merge two strings into a single object using the + operator. No single blanks are inserted, and you cannot concatenate mix types. So, if you want to merge a string and an integer, you will need to convert the integer to a string.


In [ ]:
number = 5
print("I have " + str(number) + " apples")

Optionally, we can assign the concatenated string to a variable:


In [ ]:
my_string = "I have " + str(number) + " apples"
print(my_string)

In addition to using + to concatenate strings, we can also use the multiplication sign * in combination with an integer for repeating strings (note that we again need to add a blank after 'apples' if we want it to be inserted):


In [ ]:
my_string = "apples " * 5
print(my_string)

The difference between "," and "+" when printing and concatenating strings can be confusing at first. Have a look at these examples to get a better sense of their differences.


In [ ]:
print("Hello", "World")

In [ ]:
print("Hello" + "World")

In [ ]:
print("Hello " + "World")

In [ ]:
print(5, "eggs")

In [ ]:
print(str(5), "eggs")

In [ ]:
print(5 + " eggs")

In [ ]:
print(str(5) + " eggs")

In [ ]:
text = "Hello" + "World"
print(text)
print(type(text))

In [ ]:
text = "Hello", "World"
print(text)
print(type(text))

5.1 Using f-strings

We can imagine that string concatenation can get rather confusing and unreadable if we have more variables. Consider the following example:


In [ ]:
name = "Chantal"
age = 27
country = "The Netherlands"

introduction = "Hello. My name is " + name + ". I'm " +  str(age) + " years old and I'm from " + country + "."
print(introduction)

Luckily, there is a way to make the code a lot more easy to understand and nicely formatted. In Python, you can use a string formatting mechanism called Literal String Interpolation. Strings that are formatted using this mechanism are called f-strings, after the leading character used to denote such strings, and standing for "formatted strings". It works as follows:


In [ ]:
name="Chantal" 
age=27
country="The Netherlands"
introduction = f"Hello. My name is {name}. I'm {age} years old and I'm from {country}."
introduction

We can even do cool stuff like this with f-strings:


In [ ]:
text = f"Next year, I'm turning {age+1} years old."
print(text)

Other formatting methods that you may come across include %-formatting and str.format(), but we recommend that you use f-strings because they are the most intuitive.

6. String methods

A method is a function that is associated with an object. For example, the string-method lower() turns a string into all lowercase characters, and the string method upper() makes strings uppercase. You can call this method using the dot-notation as shown below:


In [ ]:
string_1 = 'Hello, world!'
print(string_1)         # The original string.
print(string_1.lower()) # Lowercased.
print(string_1.upper()) # Uppercased.

6.1 Learning about methods

So how do you find out what kind of methods an object has? There are two options:

  1. Read the documentation. See here for the string methods.
  2. Use the dir() function, which returns a list of method names (as well as attributes of the object). If you want to know what a specific method does, use the help() function.

Run the code below to see what the output of dir() looks like.

The method names that start and end with double underscores ('dunder methods') are Python-internal. They are what makes general methods like len() work (len() internally calls the string.__len__() function), and cause Python to know what to do when you, for example, use a for-loop with a string.

The other method names indicate common and useful methods.


In [ ]:
# Run this cell to see all methods for strings
dir(str)

If you'd like to know what one of these methods does, you can just use help() (or look it up online):


In [ ]:
help(str.upper)

It's important to note that string methods only return the result. They do not change the string itself.


In [ ]:
x = 'test'    # Defining x.
y = x.upper() # Using x.upper(), assigning the result to variable y.
print(y)      # Print y.
print(x)      # Print x. It is unchanged.

Below we illustrate some of the string methods. Try to understand what is happening. Use the help() function to find more information about each of these methods.


In [ ]:
# Find out more about each of the methods used below by changing the name of the method
help(str.strip)

In [ ]:
s = ' Humpty Dumpty sat on the wall '
print(s)
s = s.strip() 
print(s)

print(s.upper())
print(s.lower())

print(s.count("u"))
print(s.count("U"))

print(s.find('sat'))
print(s.find('t', 12))
print(s.find('q', 12))

print(s.replace('sat on', 'fell off'))

words = s.split()    # This returns a list, which we will talk about later.
for word in words:   # But you can iterate over each word in this manner
    print(word.capitalize())

print('-'.join(words))

Exercises

Exercise 1:

Can you identify and explain the errors in the following lines of code? Correct them please!


In [ ]:
print("A message").
print("A message')
print('A message"')

Exercise 2:

Can you print the following? Try using both positive and negative indices.

  • the letter 'd' in my_string
  • the letter 'c' in my_string

In [ ]:
my_string = "Sandwiches are yummy"
# your code here

Can you print the following? Try using both positive and negative indices.

  • make a new string containing your first name and print its first letter
  • print the number of letters in your name

In [ ]:
# your code here

Exercise 3:

Can you print all a's in the word 'banana'?


In [ ]:
# your code here

Can you print 'banana' in reverse ('ananab')?


In [ ]:
# your code here

Can you exchange the first and last characters in my_string ('aananb')? Create a new variable new_string to store your result.


In [ ]:
my_string = "banana"
new_string = # your code here

Exercise 4:

Find a way to fix the spacing problem below keeping the "+".


In [ ]:
name = "Bruce Banner"
alterego = "The Hulk"
colour = "Green"
country = "USA"

print("His name is" + name + "and his alter ego is" + alterego + 
      ", a big" + colour + "superhero from the" + country + ".")

How would you print the same sentence using ","?


In [ ]:
name = "Bruce Banner"
alterego = "The Hulk"
colour = "Green"
country = "USA"

print("His name is" + name + "and his alter ego is" + alterego + 
      ", a big" + colour + "superhero from the" + country + ".")

Can you rewrite the code below using an f-string?


In [ ]:
name = "Bruce Banner"
alterego = "The Hulk"
colour = "green"
country = "the USA"
birth_year = 1969
current_year = 2017

print("His name is " + name + " and his alter ego is " + alterego + 
      ", a big " + colour + " superhero from " + country + ". He was born in " + str(birth_year) + 
      ", so he must be " + str(current_year - birth_year - 1) + " or " + str(current_year - birth_year) + 
      " years old now.")

Exercise 5:

Replace all a's by o's in 'banana' using a string method.


In [ ]:
my_string = "banana"
# your code here

Remove all spaces in the sentence using a string method.


In [ ]:
my_string = "Humpty Dumpty sat on the wall"
# your code here

What do the methods lstrip() and rstrip() do? Try them out below.


In [ ]:
# find out what lstrip() and rstrip() do

What do the methods startswith() and endswith() do? Try them out below.


In [ ]:
# find out what startswith() and endswith() do